High-Dimensional Inference with the generalized Hopfield Model: 
Principal Component Analysis and Corrections 
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We consider the problem of inferring the interactions between a set of N binary variables from the 
knowledge of their frequencies and pairwise correlations. The inference framework is based on the 
Hopfield model, a special case of the Ising model where the interaction matrix is defined through a 
set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum 
Likelihood inference is deeply related to Principal Component Analysis when the amplitude of the 
pattern components, £, is negligible compared to yN. Using techniques from statistical mechanics, 
we calculate the corrections to the patterns to the first order in £/y~N. We stress that it is important 
to generalize the Hopfield model and include both attractive and repulsive patterns, to correctly 
infer networks with sparse and strong interactions. We present a simple geometrical criterion to 
decide how many attractive and repulsive patterns should be considered as a function of the sampling 
noise. We moreover discuss how many sampled configurations are required for a good inference, as 
a function of the system size N and of the amplitude £. The inference approach is illustrated on 
synthetic and biological data. 



I. INTRODUCTION 

Understanding the patterns of correlations between the 
components of complex systems is a fundamental issue in 
various scientific fields, ranging from neurobiology to ge- 
nomic, from finance to sociology, ... A recurrent problem 
is to distinguish between direct correlations, produced by 
physiological or functional interactions between the com- 
ponents, and network correlations, which are mediated 
by other, third-party components. Various approaches 
have been proposed to infer interactions from correla- 
tions, exploiting concepts related to statistical dimen- 
sional reduction [lj, causality @, the maximum entropy 
principle Q, Markov random fields Q ... A major prac- 
tical and theoretical difficulty in doing so is the paucity 
and the quality of data: reliable analysis should be able 
to unveil real patterns of interactions, even if measures 
are affected by under- or noisy sampling. The size of 
the interaction network can be comparable to or larger 
than the number of data, a situation referred to as high- 
dimensional inference. 

The purpose of the present work is to establish a quan- 
titative correspondence between two of those approaches, 
namely the inference of Boltzmann Machines (also called 
Ising model in statistical physics and undirected graph- 
ical models for discrete variables in statistical inference 
(H) and Principal Component Analysis (PCA) In- 
verse Boltzmann Machines (BM) are a mathematically 
well-founded but computationally challenging approach 
to infer interactions from correlations. Our scope is to 
find the interactions among a set of N variables cr = 
{<7i, (72, ... , ctat}. For simplicity, we consider variables Oi 
taking binary values ±1 only; the discussion below can 
be easily extended to the case of a larger number of val- 
ues, e.g. to genomics where nucleotides are encoded by 
four-letter symbols, or to proteomics where amino-acids 
can take twenty values. Assume that the average values 



of the variables, mi — (ci), and the pairwise correla- 
tions, Cij = (<Ji<Jj) are measured, for instance, through 
the sampling of, say, B configurations cr b ,b = 1, . . . , B. 
Solving the inverse BM problem consists in finding the 
set of interactions, Jij, and of local fields, hi, defining an 
Ising model, such that the equilibrium magnetizations 
and pairwise correlations coincide with, respectively, 
and c^. Many procedures have been designed to tackle 
this inverse problem, including learning algorithms [5j, 
advanced mean-field techniques @, B, message-passing 
procedures [1, |f|, cluster expansions flfl [Tl| . graphical 
lasso [H and its variants [12]. The performance (accu- 
racy, running time) of those procedures depend on the 
structure of the underlying interaction network and on 
the quality of the sampling, i.e. how large B is. 

Principal Component Analysis (PCA) is a widely pop- 
ular tool in statistics to analyze the correlation structure 
of a set of variables cr = {a±, a 2, ■ ■ ■ , cat}- The principle 
of PCA is simple. One starts with the correlation matrix, 
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which expresses the covariance between variables <jj and 
j , rescaled by the product of the expected fluctuations 
of the variables taken separately. T is then diagonal- 
ized. The projections of cr along the top eigenmodes 
(associated to the largest eigenvalues of T) identify the 
uncorrelated variables which contribute most to the to- 
tal variance. If a few, say, p (<C N), eigenvalues are 
notably larger than the remaining ones PCA achieves an 
important dimensional reduction. The determination of 
the number p of components to be retained is a delicate 
issue. It may be done by comparing the spectrum of 
r to the Marcenko-Pastur (MP) spectrum for the null 
hypothesis, that is, for the correlation matrix calculated 
from the sampling of B configurations of N independent 
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variables [l3| . Generally those two spectra coincide when 
N is large, except for some large or small eigenvalues of 
r, retained as the relevant components. 

The advantages of PCA are multiple, which explains 
its success. The method is very versatile and fast as it 
only requires to diagonalize the correlation matrix, which 
can be achieved in a time polynomial in the size N of the 
problem. In addition, PCA may be extended to incor- 
porate prior information about the components, which is 
particularly helpful for processing noisy data. An illus- 
tration is sparse PCA, which looks for principal compo- 
nents with many vanishing entries fl4j [. 

In this paper we present a conceptual and practical 
framework which encompasses BM and PCA in a con- 
trolled way. We show that PCA, with appropriate mod- 
ifications, can be used to infer BM and discuss in detail 
the amount of data necessary to do so. Our framework is 
based on an extension of a celebrated model of statistical 
mechanics, the Hopfield model [15|]. The Hopfield model 
was originally introduced to model auto-associative mem- 
ories, and relies on the notion of patterns [l6j . Informally 
speaking, a pattern £ = (£1, . . . , £jv) defines an attractive 
direction in the TV-dimensional space of the variable con- 
figurations, i.e. a direction along which cr has a tendency 
to align. The norm of £ characterizes the strength of the 
attraction. While having only attractive patterns makes 
sense for auto-associative memories, it is an unnecessary 
assumption in the context of BM. We therefore general- 
ize the Hopfield model by including repulsive patterns £, 
that is, directions in the A^-dimensional space which cr 
tends to be orthogonal to [l7|. From a technical point 
of view, the generalized Hopfield model with p attractive 
patterns and p repulsive patterns is simply a particular 
case of BM with an interaction matrix, J, of rank equal 
to p + p. If one knows a priori that the rank of the true 
J is indeed small, i.e. p + p -C N , using the generalized 
Hopfield model rather than a generic BM allows one to 
infer much less parameters and to avoid ovcrfitting in the 
presence of noisy data. 

We first consider the case where the components & and 
£i are very small compared to y/N. In this limit case we 
show that Maximum Likelihood (ML) inference with the 
generalized Hopfield model is closely related to PCA. The 
attractive patterns are in one-to-one correspondence with 
the largest components of the correlation matrix, while 
the repulsive patterns correspond to the smallest com- 
ponents, which are normally discarded by PCA. When 
all patterns are selected (p + p — N) inference with the 
generalized Hopfield model is equivalent to the mean- 
field approximation 6]. Retaining only few significative 
components helps, in principle, to remove noise from the 
data. We present a simple geometrical criterion to decide 
in practice how many attractive and repulsive patterns 
should be considered. We also address the question of 
how many samples (B) are required for the inference to 
be meaningful. We calculate the error bars over the pat- 
terns due to the the finite sampling. We then analyze 
the case where the data are sampled from a generalized 



Hopfield model, and inference amounts to learn the pat- 
terns of that model. When the system size, N, and the 
number of samples, B, are both sent to infinity with a 
fixed ratio, a = -j^, there is a critical value of the ratio, 
a c , below which learning is not possible. The value of 
a c depends on the amplitude of the pattern components. 
This transition corresponds to the retarded learning phe- 
nomenon discovered in the context of supervised learn- 
ing with continuous variables and rigorously studied in 
random matrix and probability theories, see [l3|, HH, [l9[ 
for reviews. We validate our findings on synthetic data 
generated from various Ising models with known inter- 
actions, and present applications to ncurobiological and 
proteomic data. 

In the case of a small system size, N, or of very strong 
components, the ML patterns do not coincide with 

the components identified by PCA. We make use of tech- 
niques from the statistical mechanics of disordered sys- 
tems originally intended to calculate averages over en- 
sembles of matrices to compute the likelihood to the sec- 
ond order in powers of jL. for a given correlation matrix. 
We give explicit expressions for the ML patterns in terms 
of non-linear combinations of the eigenvalues and eigen- 
vectors of the correlation matrix. These corrections are 
validated on synthetic data. Furthermore, we discuss the 
issue of how many sampled configurations are necessary 
to improve over the leading-order ML patterns as a func- 
tion of the amplitude of the pattern components and of 
the system size. 

The plan of the paper is as follows. In Section |H] we 
define the generalized Hopfield model, the Bayesian in- 
ference framework and list our main results, that is, the 
expressions of the patterns without and with corrections, 
the criterion to decide the number of patterns, and the 
expressions for the error bars on the inferred patterns. 
Tests on synthetic data are presented in Section Hill Sec- 
tion IIVI is devoted to the applications to real biological 
data, i.e recordings of the neocortical activity of a be- 
having rat and consensus multi-sequence alignment of 
the PDZ protein domain family. Readers interested in 
applying our results rather than in their derivation need 
not read the subsequent sections. Derivation of the log- 
likelihood with the generalized Hopfield model and of the 
main inference formulae can be found in Section [V] In 
Section I VII we study the minimal number B of samples 
necessary to achieve an accurate inference, and how this 
number depends on the number of patterns and on their 
amplitude. Perspectives and conclusions are given in Sec- 
tion En 



II. DEFINITIONS AND MAIN RESULTS 
A. Generalized Hopfield Model 

We consider configurations cr = {a\, , o~2, ■ ■ ■ , o~n} of N 
binary variables taking values Oi = ±1, drawn according 
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to the probability 

P r^-IV. sen {fi^l] - exp-^to-.h.OT.^}] (9) 

ft '1b,« }.{« M z|MO , {ni ■ < 2 » 

where the energy E is given by 

JV p / N \ 2 

E[a,h,{e},{t}} = -E^-^E 

i=l ju=l \i=l / 

Ji=l \i=l / 



(3) 



The partition function Z in ([2]) ensures the normaliza- 
tion of Ph. The components of h = (hi, h%, ■ ■, /i/v) are 
the local fields acting on the variables. The patterns 

^ = Ui .fsn ■ • ■ with /" = are attrac- 

tive patterns: they define preferred directions in the con- 
figuration space er, along which the energy E decreases 
(if the fields are weak enough). The patterns £ , with 
/i = 1,2,... ,p, are repulsive patterns: configurations cr 
aligned along those directions have a larger energy. The 
pattern components, £f,£f, and the fields, hi, are real- 
valued. Our model is a generalized version of the original 
Hopfield model [HI, which has only attractive patterns 
and corresponds to p = 0. In the following, we will as- 
sume that p + p is much smaller than N. 

Energy function ([3]) implicitly defines the coupling Jy 
between the variables iTj and Oj , 
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(4) 



Note that any interaction matrix can be written un- 
der the form with p and p being, respectively, the 
number of positive and negative eigenvalues of J. Here, 
we assume that the total number of patterns, p + p, i.e. 
the rank of the matrix J is (much) smaller than the sys- 
tem size, N. 

The data to be analyzed consists of a set of B config- 
urations of the N spins, cr b , b = 1, . . . , B. We assume 
that those configurations are drawn, independently from 
each other, from the distribution Pjj $2$. The parame- 
ters defining Pg, that is, the fields h and the patterns 
} are unknown. Our scope is to determine the 
most likely values for those fields and patterns from the 
data. In Bayes inference framework the posterior distri- 
bution for the fields and the patterns given the data {cr b } 



p[h,{e},{t}\w b }} = 



Po[h,m,{'t}} 

PilW b }} 



(5) 



l[p H [<T h \h,{e},{t}], 



6=1 



where Pq encodes some a priori information over the pa- 
rameters to be inferred and Pi is a normalization. 



It is important to realize that many transformations af- 
fecting the patterns can actually leave the coupling ma- 
trix J J3J and the distribution Ph unchanged. A sim- 
ple example is given by an orthogonal transformation O 
over the attractive patterns : £f — > £f = ^ n/ O liV ^. 
This invariance entails that the the problem of inferring 
the patterns is not statistically consistent: even with an 
infinite number of sampled data no inference procedure 
can distinguish between a Hopfield model with patterns 
and another one with patterns {£ }. However, the 
inference of the couplings is statistically consistent: two 
distinct matrices J define two distinct distributions over 
the data. 

In the presence of repulsive patterns the complete in- 
variance group is the indefinite orthogonal group 0(p,p), 
which has h(p + p)(p + p — 1) generators. To select one 
particular set of most likely patterns, we explicitly break 
the invariance through Pq. A convenient choice we use 
throughout this paper is to impose that the weighted dot 
products of the pairs of attractive and/or repulsive pat- 
terns vanish: 
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— p(p — 1) constraints 



pp constraints 



Tj-fK-P ~ 1) constraints 



(6) 



In the following we will use the vocable Maximum Like- 
lihood inference to refer to the case where the prior P is 
used to break the invariance only. Pq may also be chosen 
to impose specific constraints on the pattern amplitude, 
see Section Hi El devoted to regularization. 



B. Maximum Likelihood Inference: lowest order 

Due to the absence of three- or higher order-body in- 
teractions in E ([3]), P depends on the data {cr 6 } only 
through the N magnetizations, mt, and the ^N(N — 1) 
two-spin covariances, cy, of the sampled data: 
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We consider the correlation matrix T ([T]), and call A 1 > 
. . . > A fc > A fe+1 > . . . > X N its eigenvalues. v k de- 
notes the eigenvector attached to A fc and normalized to 
unity. We also introduce another notation to label the 
same eigenvalues and eigenvectors in the reverse order: 
X k = X N+1 ~ k and v k = v^ 1 ^, e.g. A 1 is the small- 
est eigenvalue of T; the motivation for doing so will be 
transparent below. Note that T is, by construction, a 
semi-definite positive matrix: all its eigenvalues are pos- 
itive. In addition, the sum of the eigenvalues is equal 
to N since Tu = l,Vi. Hence the largest and smallest 
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eigenvalues are guaranteed to be, respectively, larger and 
smaller than unity. 

In the following Greek indices, i.e. /i,is,p, correspond 
to integers comprised between 1 and p or p, while roman 
letters, i.e. i,j,k denote integers ranging from 1 to N. 

Finding the patterns and fields maximizing P ((5]) is a 
very hard computational task. We introduce an approx- 
imation scheme for those parameters 



CM 

hi 



(8) 



The derivation of this systematic approximation scheme 
and the discussion of how smaller the contributions get 
with the order of the approximation can be found in Sec- 
tion IV Al To the lowest order the patterns are given by 



0\M 



>N 



1 



where T _1 denotes the inverse matrix of T ([TJ. However, 
while all the eigenmodes of T are taken into account in the 
MF interactions (fT2"j). our lowest-order interactions (JTUJ) 
include contributions from the p largest and the p small- 
est eigenmodes only. As the values of p, p can be chosen 
depending on the number of available data, the general- 
ized Hopfield interactions (1101) is a priori less sensitive to 
overfitting. In particular, it is possible to avoid consider- 
ing the bulk part of the spectrum of T, which is essentially 
due to undersampling ([13| and Section IVIB 2p . 



C. Sampling error bars on the patterns 



The posterior distribution P can locally be approxi- 
mated with a Gaussian distribution centered in the most 
likely values for the patterns, {(£ Y}> an( i the 

fields, h°. We obtain the covariance matrix of the flue- 



(1 < fj, < p\ (9) tuations of the patterns around their most likely values, 
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X" 



- 1 



(i < it < p) ■ 



(Atf A#> = 



N [M 



i //// 



B^(l-m?)(l-mp 



(13) 



The above expressions require that A M > 1 for an attrac- 
tive pattern and A M < 1 for a repulsive pattern. Once 
the patterns are computed the interactions, {J°)ij, can 
be calculated from 



and identical expressions for (A£f A£J) and (A£f A£J) 
upon substitution of [M^] .. with, respectively, 
[M ? |]^ and [M||]^. The entries of the M matrices 



(J% = 



(l-mf)(l-m|) 
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(10) 



The values of the local fields are then obtained from 



N-p 



* |A"-l|A'<t>? 



Gi(A',AM) 



G 2 (A",A") 
Gi(A",A 



tanh 1 77i, 
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777 i 



(11) 



r M Y v - G ^^ v ) V »M 



(14) 



which has a straightforward mean-field interpretation. 

The above results are reminiscent of PCA, but differ 
in several significative aspects. First, the patterns do 
not coincide with the eigenvectors due to the presence 
of TOj-dependent terms. Secondly, the presence of the 
A M -dependent factor in ([9]) discounts the patterns corre- 
sponding to eigenvalues close to unity. This effect is easy 
to understand in the limit case of independent spins and 
perfect sampling (B — > oo): T is the identity matrix, 
which gives A M = l,V/i, and the patterns rightly vanish. 
Thirdly, and most importantly, not only the largest but 
also the smallest eigenmodes must be taken into account 
to calculate the interactions. 

The couplings J° (fT0)l calculated from the lowest-order 
approximation for the patterns are closely related to the 
mean-field (MF) interactions [(J, 



and [M||] „ is obtained from [M^] 4 . upon substitution 

of A M , A", 7jf , v" with, respectively, A^, A", vf, v". Func- 
tions Gi and G2 are defined through 



Gi(x,y) = (x\y-l\+y\x-l\ 
G 2 (x,y) = \Jxy\x- 1| \y- 1| . 



(15) 



t mf _ 



^(1 



(12) 



The covariance matrix of the fluctuations of the fields is 
given in Section fVDI Error bars on the couplings (|4|) can 
be calculated from the ones on the patterns. 

Formula (fT3| tells us how significative are the inferred 
values of the patterns in the presence of finite sampling. 
For instance, if the error bar ((A£f ) 2 ) 1 / 2 is larger than, or 
comparable with the pattern component calculated 
from (J9J) then this component is statistically compatible 
with zero. According to formula (|13l) we expect error 
bars of the order of ^= over the pattern components, 

where a = jj. 
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FIG. 1: Geometrical representation of identity (I16|l . express- 
ing the rescaled pattern £' as a linear combination of the 
eigenvector v and of the orthogonal fluctuations (3. The most 
likely rescaled pattern, (£')°> corresponds to a = 1 — j,/3 = 

O.The dashed arc has radius ^Jl — j. The subscript fi has 
been dropped to lighten notations. 



through 



(18) 



see Fig. Q] Small values of 9^ correspond to reliable pat- 
terns, while large 9^ indicate that the Maximum Like- 
lihood estimator of the [i th pattern is plagued by noise. 
The value of p such that 9 P is, say, 
mate for the number of attractive patterns. 

The above approach can be easily repeated in the case 
of repulsive patterns. We obtain, with obvious notations, 



about j is our esti- 



<G9 
and 



N-p 



i 



k=p+l 



X k - A' 1 



(19) 



\ 



<0" 



i" 1 



(20) 



D. Optimal numbers of attractive and repulsive 
patterns 



The value of p such that 9? is, say, about ? is our estimate 
for the number of repulsive patterns. 



We now determine the numbers of patterns, p and p, 
based on a simple geometric criterion; the reader is re- 
ferred to Section IV El for detailed calculations. To each 
attractive pattern £ M we associate the rescaled pattern 
(£ M )', whose components are (£f)' = £f \/l — m l I VN. 
We write 



(16) 



where is a positive coefficient, and /3 M is a vector or- 
thogonal to all rescaled patterns by virtue of (JB]) (Fig.QJ. 
Our lowest order formula © for the Maximum Likeli- 
hood estimators gives a M = 1 — and /3 M = 0, see 
Fig. [T] This result is, to some extent, misleading. While 
the most likely value for the vector /3 M is indeed zero, 
its norm is almost surely not vanishing! The statement 
may appear paradoxical but is well-known to hold for 
stochastic variables: while the average or typical value 
of the location of an isotropic random walk vanishes, its 
average squared displacement does not. Here, /3 M repre- 
sents the stochastic difference between the pattern to be 
inferred and the direction of one of the largest eigenvec- 
tors of r. We expect the squared norm (/3 M ) 2 to have 
a non-zero value in the N, B — > oo limit at fixed ratio 
> 0. Its average value can be straightforwardly 



computed from formula (1141) . 



N-p 



M\2\ 



B ^ \^ 

k=p+l 



(17) 



where fi is the index of the pattern. We define the angle 
9** between the eigenvector v^ 1 and the rescaled pattern 



E. Regularization 

So far we have considered that the prior probability Pa 
over the patterns was uniform, and was used to break the 
invariance through the conditions ([6J. The prior proba- 
bility can be used to constrain the amplitude of the pat- 
terns. For instance, we can introduce a Gaussian prior 
on the patterns, 



Po oc exp 



JV 




n (21) 

which penalizes large pattern components [11] . The pres- 



ence of the (1 



m 



! ) factor entails that the effective 



strength of the regularization term, 7(1 — rnf), depends 
on the site magnetization. Regularization is particularly 
useful in the case of severe undersampling. With regular- 
ization (f2"T|) the lowest order expression for the pattern is 
still given by ©, after carrying out the following trans- 
formation on the eigenvalues, 



A M -> 

\ k -> \ k , 

A" -> A> + 7 



(fj, = l,...,p) , 

(k = p + l,...,N -p) , 

(/i=l,...,j5) . (22) 



The values of p and p must be such that the transformed 
\ p and X p are, respectively, larger and smaller than unity. 
Regularization (|2"Tj) ensures that the couplings do not 
blow up, even in the presence of zero eigenvalues in T. 
Applications will be presented in Sections IIIII and IIVI 
The value of the regularization strength 7 can be chosen 
based on a Bayesian criterion (2fj| . 
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F. Maximum likelihood inference: first corrections 



G. Quality of the inference vs. size of the data set 



We now give the expression for the first-order correc- 
tion to the attractive patterns, 



(O 



1\M 



A 



N 



1 — m' 



A k » B k » v k , 



(23) 



k=l 



where 



N 



E+ E |fA"-i) 

K p=l p=N+l-pj 



and 



\»—\ 



if k < p 



(24) 



(25) 



and 



C k = 



i v k 



if k>p+l 



N 



E+ E l(^-i)«) 2 

p=l p=N+l-pi 



(26) 

Similarly, the first corrections to the repulsive patterns 
are 



K ) 



1^ 



iV 



AT 



1 — m; 



E ^ B k ^ v k 



(27) 



fc=i 



f. The 



The accuracy e on the inferred pattern is limited both 
by the sampling error resulting from the finite number 
of data and the intrinsic error due to the expansion ([SJ • 
According to Section Hi CI the sampling error on the pat- 
tern component is expected to decrease as <■ 
intrinsic error depends on the order of the expansion, on 
the size A and on the amplitude of the patterns. 

No inference is possible unless the ratio a — |l exceeds 
a critical value, referred to as a c in the following (Sec- 
tion |VIA~2]). This phenomenon is similar to the retarded 
learning phase transition discovered in the context of un- 
supervised learning [l8j . 

Assume that the pattern components £j are of the or- 
der of one (compared to A), that is, that the couplings 
are almost all non zero and of the order of . Then, the 
intrinsic error is of the order of -A* with the lowest or- 
der formula , and of the order of when corrections 
(|23p are taken into account; for a more precise statement 
see Section IV Al and formula (|4"5|) . The corresponding 
values of B at which saturation takes place are, respec- 
tively, of the order of A 3 and A 5 . The behaviour of 
the relative error between the true and inferred patterns, 
e (|32j) . is summarized in Fig. [2] In general we expect 
that B ~ jv 1+2a samples at least are required to have 
a more accurate inference with a^-order patterns than 
with (a — l) t/l -order patterns. Furthermore there is no 
need to sample more than _/V 3+2a configurations when 
using the a t,l -order expression for the patterns. 

If the system has O(N) non vanishing couplings Jjj of 
the order of J, then patterns have few large components, 
of the order of \fj . In this case the intrisic error over the 
patterns will be of the order of J with the lowest order 
inference formulae, and of the order of J 2 with the first 
corrections. The numbers of sampled configurations, B, 



The definition of A k ^ is identical to with C M and 

v? replaced with, respectively, C ' N + 1 ~f i and v£. Finally, 



i-A> 



if k>N-p+l 



n/Mi-v-) . 



(28) 



if k<N-p 



The first order corrections to the fields hi can be found 
in Section EB 

It is interesting to note that the corrections to the pat- 
tern £ M involve non-linear interactions between the eigen- 
modes of T. Formula (f24|) for A hfl shows that the modes /i 
and k interact through a multi-body overlap with mode 
p (provided A p ^ 1). In addition, A ^ does not a pri- 
ori vanish for k > p + 1: corrections to the patterns 
have non-zero projections over the 'noisy' modes of T. 
In other words, valuable information over the true values 
of the patterns can be extracted from the eigenmodes of 
r associated to bulk eigenvalues. 




FIG. 2: Schematic behaviour of the error e on the inferred 
patterns as a function of the number B of sampled configu- 
rations and for a problem size equal to N, when the pattern 
components are of the order of unity compared to TV. See 
main text for the case of few large pattern components, of 
the order of y/N, i.e. couplings J of the order of 1. 
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required to reach those minimal errors will be, respec- 



0.05 



tively, of the order of -j? 



and jj. 



III. TESTS ON SYNTHETIC DATA 

In this Section we test the formulae of Section Ql] for 
the patterns and fields against synthetic data generated 
from various Ising models with known interactions. We 
consider four models: 

• Model A is a Hopfield model with N = 100 spins, 
p (= 1 or 3) attractive patterns and no repulsive 
pattern (p = 0). The components of the patterns 
are Gaussian random variables with zero mean and 
standard deviation £, specified later. The local 
fields hi are set equal to zero. 

• Model B: Model B consists of N spins, grouped 
into four blocks of ^ spins each. The p — 3 pat- 
terns have uniform components over the blocks: 

e = 2^(0,1,1,1), e = §(V3,i,-2,i), e = 

|(\/3, — 2, 1, 1). The fields are set to zero. Those 
choices ensure that the pattern are orthogonal to 
each other, and have a weak intensity: on average, 

iei 2 -i<i- 

• Model C is a very simple Ising model where all fields 
and couplings vanish, except coupling J12 = J be- 
tween the first two spins. 

• Model D is an Ising model with N = 50 spins, on 
an Erdos-Renyi random graph with average con- 
nectivity (number of neighbors for each spin) equal 
to d = 5 and coupling values J distributed uni- 
formly between -1 and 1. Model D is an instance of 
the Viana-Bray model [2l[ . In the thermodynamic 
limit N — > 00 this model is in the spin glass phase 
since d(tanh 2 (J))j > 1 [2l|. 



For each one of the models above, the magnetizations 
and pairwise correlations can be estimated through the 
sampling of B configurations at equilibrium using Monte 
Carlo simulations. This allows us to estimate the conse- 
quence of sampling noise on the inference quality by vary- 
ing the value of B. Furthermore, for models B and C, it 
is possible to obtained the exact Gibbs values for rrii and 
Cij (corresponding to a perfect sampling, B = oo)[40l|. 
This allows us to study the systematic error resulting 
from formulae (|9l23l27p . irrespectively of the sampling 
noise. 

Model A is used to test the lower order formula for 
the patterns, and how the quality of inference depends 
on the amplitude of the patterns. Models C and D are 
highly diluted networks with strong J = 0(1) interac- 
tions, while models A and B correspond to dense net- 
works with weak J = 0(-A>) couplings. Models C and 
D are, therefore, harder benchmarks for the generalized 
Hopfield model. In addition, the couplings implicitly de- 
fine, through (jj), both attractive and repulsive patterns. 
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FIG. 3: Application of formula |[9j) to two sets of B = 40 
(top) and 400 (bottom) configurations, randomly generated 
from the distribution Ph (0 for model A with p = 1 pattern. 
The standard deviation of the pattern components is £ = 
.7. Left: comparison of the true and inferred couplings for 
each pair Right: comparison of the true and inferred 

components £j of the pattern, with the error bars calculated 
from (|13[) . The dashed lines have slope unity. Inference is 
done with p = 1 attractive pattern and no repulsive pattern. 



Those models can thus be used to determine how much 
repulsive patterns are required for an accurate inference 
of general Ising models. 



A. Dominant order formula for the patterns 

We start with Model A with p = 1 pattern. In this 
case, no ambiguity over the inferred pattern is possi- 
ble since the energy E is not invariant under continu- 
ous transformations, see Section III Al We may therefore 
directly compare the true and the inferred patterns. Fig- 
ures [3] and 2] show the accuracy of the lowest order for- 
mula for the patterns, eqn @. If the pattern components 
are weak, each sampled configuration <r is weakly aligned 
along the pattern If the number B of sampled config- 
urations is small, the largest eigenvector of Y is uncorre- 
cted with the pattern direction (Fig. [3]). When the size 
of the data set is sufficiently large, i.e. B > ct c N {Sec- 
tion I VI A~2l . formula © captures the right direction of 
the pattern, and the inferred couplings are representative 
of the true interactions. Conversely, if the amplitudes of 
the components of the pattern £ are strong enough, each 
sampled configuration er is likely to be aligned along the 
pattern. A small number B (compared to TV) of those 
configurations suffice to determine the pattern (Fig. @| . 
In the latter case, we see that the largest components & 
are systematically underestimated. A systematic study 
of how large B should be for the inference to be reliable 
can be found in Section IVT1 
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FIG. 4: Same as Fig. [3] but with a standard deviation £ = 
1.3 instead of £ = .7. The amplitude is strong enough to 
magnetize the configurations along the pattern, see Sections 
IVIAll and lvrB3l 
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FIG. 5: Relative differences between the true and the inferred 
couplings, AJ a b/Jab for three system sizes, N. The inference 
was done using the lowest order ML formulae ((9]l for the pat- 
terns. Data were generated from Model B (perfect sampling); 
there are a prion ten distinct values of the couplings, one for 
each pair of blocks a and b. Inset: average value of AJ a t,/J a b 
as a function of -k. Circles, squares and diamonds correspond 
to, respectively, N = 52, 100 and 200 spins. 



We now use model B to generate the data. As model 
B includes more than one pattern, the inferred patterns 
cannot be compared to the true one easily due to the in- 
variance of Section III Al We therefore compare in Fig. [5] 
the true couplings and the interactions found using © 
for three sizes, N = 52, 100 and 200. The size N sets 
also the amplitude of the couplings, which decreases as 
jr from As the patterns are uniform among each 
one of the four blocks there are ten possibles values for 
the couplings Jij , depending on the labels a and b of the 
blocks to which i and j belong, with 1 < a < b < 4. 
For N — 100 spins, the relative errors range between 3 
and 5.5%. When the number of spins is doubled (respec- 
tively, halved) the relative errors are about twice smaller 
(respectively, larger). This result confirms that formula 
([5]) is exact in the infinite N limit only, and that correc- 
tions of the order of O(j^) are expected for finite system 
sizes (Inset of Fig. [5]). This scaling was expected from 
Section HTG] 

We now consider model C. For perfect sampling (B = 
oo) the correlation matrix ([1} is 



I 1 tanh J ... \ 

tanhJ 1 Q ... 

1 ... 

... 1 

V ... 1/ 



(29) 



The top eigenvalue, A 1 
eigenvalue, A 1 = = 



= 1+tanh J > 1, and the smallest 
1 — tanh J < 1, are attached to 



the eigenvectors 



1 

71 



1 




v 1 



1 

71 



-i 
o 

V oJ 



(30) 



The remaining N — 2 eigenvalues are equal to 1. Using 
formula (fTUf for the lowest order coupling, J°, we find 
that those eigenmodes do not contribute and that the in- 
teraction can take three values, depending on the choices 
for p and p: 



(J )p=l,p=0 
(J )p=0,p=l 
(J )p=l,£=l 



tanh J J J 2 

2 (1 + tanh J) ~ 2 ~2 

tanh J J J 2 

2(1 - tanh J) ~ 2 + T 

tanh J T 2 J 3 



J 3 . 
3 

J 3 
3 



1 - tanli J 



J 



(31) 



Those expressions are plotted in Fig. [6l The coupling 
(J°)i,o (dashed line), corresponding to the standard Hop- 
field model, saturates at the value \ and does not di- 
verge with J. Even the small J behavior, (J°)i,o — 
is erroneous. Adding the repulsive pattern leads to a 
visible improvement, as fluctuations of the spin config- 
urations along the eigenvector v 1 (one spin up and the 
other down) are penalized. The inferred coupling, ( J°)i,i 
(bold line), is now correct for small J, (J°)i,i — J, and 
diverges for large values of J. 
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B. Error bars and criterion for p,p 
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FIG. 6: Inferred coupling J° between the first two spins of 
Model C, within lowest order ML, and as a function of the 
true coupling J. Values of p and p are shown in the Figure. 
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An illustration of formula ([TUf for the error bars is 
shown in Fig. [31 where we compare the components of 
the true pattern used to generate data in Model A with 
the inferred one, and the error bar, \J ((A^) 2 ). For 

small a = Mt the inferred pattern components are uncor- 
rected with the true pattern and compatible with zero 
within the error bars. For larger values of a, the discrep- 
ancy between the inferred and the true components are 
stochastic quantities of the order of the calculated error 
bars. 

We report in Fig. [8] the tests of the criterion for de- 
termining p and p on artificially generated data from an 
extension of model A with p = 3 patterns. For very poor 
sampling (Fig. [8l top) the angle 8 1 is close to \: even 
the first pattern cannot be inferred correctly. This pre- 
diction is confirmed by the very poor comparison of the 
true interactions and the inferred couplings calculated 
from the first inferred pattern. For moderately accurate 
sampling (Fig. [SJ middle) the strongest pattern can be 
inferred; the accuracy on the inferred couplings worsens 
when the second pattern is added. Excellent sampling 
allows for a good inference of the structure of the under- 
lying model: the angle M is small for /j, — 1,2,3 (Fig. HI 
bottom), and larger than ~ for fi > 4 (not shown). Not 
surprisingly large couplings are systematically affected 
by errors. Those errors can be corrected by taking into 
account 0(-^=) corrections to the patterns if the number 

of data, B , is large enough (Section IVI[) . 

Figure |H] compares the inferred and true couplings for 
B = 4500 sampled configurations of Model D. The opti- 
mal number of patterns given by the geometrical criterion 
is [p = 9,p = 35), see Fig. [7J Hence most of the compo- 
nents of r are retained and the interactions inferred with 
the generalized Hopfield model do not differ much from 
the MF couplings. 



FIG. 7: Inferred vs. true couplings for Model D, with 
B — 4500 sampled configurations. Left: Hopfield model with 
p — 9 (corresponding to the optimal number of patterns se- 
lected by the geometrical criterion); no repulsive pattern is 
considered (p — 0). Right: Generalized Hopfield model with 
(p,p) = (9, 35) (optimal numbers). 



C. Corrections to the patterns 



We now turn to Model D. Figure [7j compares the in- 
ferred and true couplings for B = 4500 sampled config- 
urations. The generalized Hopfield model outperforms 
the standard Hopfield model (p = 0), showing the im- 
portance of repulsive patterns in the inference of sparse 
networks with strong interactions. Large couplings, ei- 
ther positive or negative, are overestimated by the lowest 
order ML estimators for the patterns. 



Formula (|23|) for the corrections to the patterns was 
tested on model B in the case of perfect sampling. Re- 
sults are reported in Fig. [TU] and show that the errors in 
the inferred couplings are much smaller than in Fig. [S] 
Inset of Fig. [10] shows that the relative errors are of the 
order of ™ only. This scaling was expected from Sec- 
tionlnni Pushing our expansion of £ to the next order in 
powers of jj could in principle give explicit expressions 
for those corrections. We have also tested our higher or- 
der formula when the fields hi are non-zero. For instance 
we have considered the same Hopfield model with p = 3 
patterns as above, and with block pseudo-magnetizations 
t = yjr(2-v/3, 2, 2, — 4). Hence, t was orthogonal to the 
patterns, and the field components were simply given by 
hi = tanh -1 t i: according to (13"51) 41]. For N — 52 spins 
the relative error over the pseudo-magnetizations (aver- 
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FIG. 8: Criterion to decide the number p of patterns and 
performance of the ML inference procedure for three different 
sizes of the data set, B. Left: inferred vs. true interactions 
with p = 1,2 or 3 patterns; the dashed line has slope unity. 
Right: coefficients a M = (p M ) 2 and b' 1 = <(/3 M ) 2 ) vs. pattern 
index n, and angles # M , divided by -|, see definitions (fl6|) 
and (|18p . For each value of B one data set was generated 
from Model A with p — 3 patterns, and standard deviations 
£ J = .95, C 2 = -83, and £ 3 = .77. 
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FIG. 9: Inferred vs. true couplings for Model D, with 
B = 4500 sampled configurations. Left: Generalized Hop- 
field model with (p,p) = (9,10) and (11,39) (corresponding 
to the numbers of eigenvalues, respectively, larger and smaller 
than unity). Right: angles # M and 9^ for, respectively, at- 
tractive (triangle) and repulsive (diamond) patterns. 



o o N = 


52 


■ ■ N = 


100 


o o N = 


200 



< 0.004 
0.002 



> 



0, 



0.008 



x> 0.006 

P3 



0.0002 0.0004 



0.004^ 



^ 0.002 
0" 



o 



1/N 



1 



block (a,b) 



10 



FIG. 10: Relative differences between the true and the in- 
ferred couplings, A Jab/ Jab as a function of the system size, 
N. The inference was done using the finite- TV ML formu- 
lae ([9]) and (|23l) for the patterns. Data were generated from 
a perfect sampling of the equilibrium distribution of a Hop- 
field model with p = 3 patterns and four blocks of £ spins, 
see main text; a and b are the block indices. Inset: average 
value of ^.Jab/Jab as a function of j^. Circles, squares and 
diamonds correspond to, respectively, N = 52, 100 and 200 
spins. 



aged over the four blocks a) was ~ .0301 with the 
large-TV formula ® and ^ ~ 0.0029 with the finite- TV 
formulae (JUJ and $H§. 

Corrections to the PCA were also tested when data are 
corrupted by sampling noise. We compare in Fig. [11] the 
components of the pattern of Model A found with the 
lowest order approximation ^ and with our first order 
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FIG. 11: True vs. inferred components of the patterns, £j, for 
the model with N = 100 spins described in Fig. [4] Full circles 
are the result of the lowest order inference formula ((9]l , while 
empty circles show the outcome of the first order formulae 
J23t. 



formulae ([23"]) (case of strong pattern). A clear improve- 
ment in the quality of the inference is observed, even 
when the sampling noise is strong. Our second example 
is Model B. We show in Fig. [12] the relative errors 



N(N - 1) 



AJi, 



J 4.' 



(32) 



between the true and the inferred couplings, with for- 
mulas ([9]) and ([23]) . as a function of the number of sam- 
pled configurations, B, and for N = 52 spins. As B 
increases the relative error with the lowest order pat- 
terns (PCA) first decreases as B^ 1 / 2 , then saturates to 
the value ~ .0794, as expected from Fig. [5] The relative 
error with the correction to the patterns also decreases 
as B~ x / 2 , and is expected to saturate to the lower value 
~ .00374 (Fig. [TO]) . We remark that the gain in accu- 
racy over the inferred couplings resulting from the cor- 
rections (|2"3")l to the patterns is obtained only when B is 
very large. B ~ N 3 configurations at least should be 
sampled to obtain an improvement over the lowest order 
formula 0. This scaling holds when the couplings are 
weak, and decrease as jj. If the interaction network is 
diluted and carries couplings J — 0(1), we expect that 
B ~ N/J 2 configurations have to be sampled to make 
the first-corrections to the patterns effective. 

We have applied our formula (|23|) to calculate the first 
correction to the couplings (|3"Tj) for Models C and D. 
As for Model C, we find that the correction to the cou- 
pling (J°)i,i vanishes; this result is due to the fact that 
(J°)i,i is already correct to the second order in J, and 
that higher order corrections would be needed. The cor- 
rections to the coupling (J°)i,o are equal to 



(J 1 ) 



tanh J tanh J(l + tanh J) 



2V2 
1 

16 + 



2V2 



J ~t~ ~T~T 



16 
16 



(33) 



The resulting coupling, (J° + J 1 ) 10, is plotted as a func- 
tion of J in Fig. [BJ and qualitatively improves over the 
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FIG. 12: Relative error between the inferred and true cou- 
plings for Model B (with N — 52 spins) vs. number of sam- 
pled configurations, B. The two curves correspond to the 
inference done with the th order formula © (black circles) 
and the 1 st order formula (I23[) (squares). Each data point 
is the average over 10 samples; relative error bars are about 
1%, and are much smaller than the symbol size. The asymp- 
totic value of the errors, corresponding to perfect sampling 
(B — 00), are extracted from Figs. [51 and 1 101 



lowest order result (|3"Tj) . In particular, for small J, the 
inferred coupling is now (J° + J 1 )io ~ .916 J — .438 J 2 , 
which is definitely closer to J than (|3"Tj) . In the case of 
Model D, the first-order corrections improve only slightly 
the estimates for the large couplings. 



IV. APPLICATION TO BIOLOGICAL DATA 

In this Section we show how the inference approach 
can be applied to real biological data, and compared to 
other Boltzmann Machine learning procedures. 



A. Cortical activity of the rat 

We have first analyzed data coming from the record- 
ing of 37 neurons in the prefrontal cortex of rats. The 
experiment, done by A. Peyrache, F. Battaglia and their 
collaborators, consists in recording the neural activity 
during a task and during the Slow Wave sleep preceding 
and following the learning of the task [22]. PCA allowed 
Peyrache et al. to identify patterns in the activity, which 
are generated when the rat learns a task and are replayed 
during the sleep (22|. 

We have analyzed with the generalized Hopfield model 
the data corresponding to a 20 minute-long recording of 
the activity of a rat during the task (data shown in Fig. 
1 of |12|). The raster plot was binned with a 10 msec 
window to obtain binary configurations of the neurons 
(active or silent in the time-bin). We have then calcu- 
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FIG. 13: Couplings calculated with the generalized Hopfield 
model vs. couplings calculated with the adaptive cluster ex- 
pansion of [Tl(] for 37 cells recorded in the prefrontal cortex 
of a behaving rat. Top: Hopfield model with p = 4, 8 (corre- 
sponding to the optimal number of patterns selected by the 
geometrical criterion) and 17; no repulsive pattern is consid- 
ered (p — 0). Bottom. Generalized Hopfield model with 
(p,p) = (4,4), (8,8) (optimal numbers) and (17,20) (corre- 
sponding to the numbers of eigenvalues, respectively, larger 
and smaller than unity). 



FIG. 14: Couplings calculated with the Generalized Hopfield 
model versus coupling calculated with the adaptive cluster 
expansion for 92 amino-acids in the PDZ domain. The values 
of p, p are given in the Figure. Note that p = for the top 
panels. The middle panels correspond to the optimal values 
for the number of patterns. 



B. Protein-domain families 



lated the average frequencies, m*, and the pairwise corre- 
lations, Cij. We calculate the couplings with p attractive 
and p repulsive patterns according to © and ([TU|) . The 
numbers p and p are calculated according to the geomet- 
rical criteria (fT5]l and ([20)1 . Hereafter, we compare the 
couplings obtained this way to the ones found with the 
adaptive cluster expansion (ACE) of [TTJ] , which is not 
based on the expansion of the loglikclihood used in the 
present work. 

In Fig. [13] (top) we compare the Hopfield (p = 0) cou- 
plings with p = 4,8, 17 selected patterns to the ACE cou- 
plings. The agreement is quite good for p* = 8. In [22j 
p = 6 patterns were kept in the PCA; this value is close 
to the optimal value, p = 8, we find using the geometri- 
cal criterion. Addition of repulsive patterns (bottom of 
Fig. [Ill)) slightly improves the similarity with the ACE 
couplings. We find, indeed, that the couplings Jy are 
rather weak, and that repulsive patterns do not play an 
important role. Calculating the couplings with all eigen- 
modes (p = 17, p = 20) is equivalent ot the mean- field 
(MF) approximation. A clear discrepancy between the 
Hopfield and the ACE couplings is found for the largest 
(in absolute value) interactions. We have checked that 
this discrepancy is not reduced when the first order cor- 
rections to the patterns are included, presumably because 
the number of data is not sufficient. Couplings are not 
significatively changed in the presence of the regulariza- 
tion (f2~lj) for sensible values of 7. 



We have next analyzed the alignement of a family of 
240 sequences of PDZ, a commonly encountered domain 
binding the C-terminus of proteins, with 92 amino-acids 
(2~il ]. R. Ranganathan and collaborators have elaborated 
an approach, called Statistical Coupling Analysis(SCA), 
to extract interactions between residues by using evolu- 
tionary data for the protein, i.e. by sampling the single- 
site and pairwise frequencies from multi-sequence align- 
ments of the family 23] . Briefly speaking, SCA consists 
in doing a PCA analysis of a weighted correlation ma- 
trix, DiTijDj, where the weig ht Di on site i is small for 
poorly conserved residues [24 1 . 

We have taken the binary data representation of the 
240 PDZ sequences in the alignement given in (2f| (Sup- 
plementary Material). This consensus approximation 
amounts to replace the amino-acid on each site (20 pos- 
sible types) with a binary variable of, equal to +1 if the 
amino-acid i in the b th sequence is the most common 
amino-acid at that position in the alignment, to —1 oth- 
erwise. The consensus representation docs not allow to 
keep track of all the information contained in the align- 
ment but is indicative of the conservation pattern in the 
family. 

The inferred couplings, denoted by J 92 , are shown in 
Fig. [TU As in the case of Model D in Section HTT1 we find 
that proteomic data are better accounted with the gen- 
eralized Hopfield model than with the standard Hopfield 
model: repulsive patterns seems necessary to recover the 
couplings found with the ACE method. The couplings 
found with attractive patterns only are not correlated 
with the ACE couplings (top of Fig. IT4l , while the agree- 
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FIG. 15: Same as Fig. 1 141 when retaining the 44 residues with 
the largest weights Di only [24|] . The values of p, ft are given in 
the Figure. Note that ft — for the top panels. The middle 
panels correspond to the optimal values for the number of 
patterns. 




FIG. 16: Comparison between the couplings Jij calculated 
with all 92 residues and with the 44 most weighted residues 
only, for each one of the 44 x 43/2 pairs (i, j) of residues. 



V. EXPANSION OF THE CROSS ENTROPY 
AND MAXIMUM LIKELIHOOD INFERENCE 



merit is quite good when taking into account attractive 
and repulsive patterns; the optimal numbers of patterns 
are p — 4 and p = 10. 

We have also calculated the couplings when discard- 
ing all but the most weighted sites. More precisely, we 
have recalculated the distribution of the weights Di as 
in [ii], [25[ , and found a bimodal distribution, which sug- 
gests a natural cut-off between large and small weights. 
We have redone the previous inference when keeping only 
the 44 residues (out of 92) with the lar gest weights, corre- 
sponding to the red sites in Fig. C of [2J|. The resulting 
interactions, denoted by J 44 , are shown in Fig.[15l Again 
we compare the couplings found with the Hopfield model 
and with the ACE. The agreement is not good with at- 
tractive patterns only (as done in usual PCA), and is 
very good when repulsive patterns are included. 

An interesting question is whether the couplings ob- 
tained between the 44 most conserved residues are 
strongly affected by the presence or the absence of the 
remaining 48 residues in the inference. The interactions 
in the 44-site model are effective and a priori differ from 
their values in the 92-site model, in that they account 
for chains of interactions going through the remaining 
48 sites. Nevertheless, we find that the couplings cal- 
culated with all 92 residues and the couplings obtained 
from the subset of 44 sites with large weights are simi- 
lar, see Figim This result suggests that the 48 residues 
removed from our second analysis are not strongly inter- 
acting with the 44 retained sites. 



This Section is intended to provide the derivations of 
the results announced in Section [TTJ Maximizing the pos- 
terior probability ([5]) with respect to the patterns and the 
fields is equivalent to minimizing the cross entropy of the 
Hopfield model given the data, 

$[h, {a b }} = log Z[h, {£"}] + U[h, {£»}, {<r b }} , 
where Z is the partition function appearing in ([2]), 

Z[h, {£"}] = £exp ( - E[cr, h, {£"}]) , (35) 

(7 

and U is the average value of the energy E over the 
sampled configurations: 

N 1 

U[h, {£»}, {<r b }} = - ]T himi - - £ J tj c l3 , (36) 

where the couplings Jij are calculated from the patterns 
according to (@|. The calculation of the partition func- 
tion, which is defined as a sum over 2 N configurations, 
cannot generally be done in a reasonable time for large 
sizes N. In the next section we show how the use of 
statistical mechanics techniques allows one to obtain a 
systematic expansion of Z, and, thus, of the cross en- 
tropy 

$ = $° + $! + ... , (37) 
in powers on -4= and -4=. 
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A. Expansion of the free energy of the Hopfield 
model in powers of -^=, -^j= 



To lighten notations calculations are presented for the 
case of attractive patterns only. We explain at the end 
of the Section how formulae are modified in the presence 
of repulsive patterns. 

For technical reasons to be made clear below it results 
convenient to make the change of variables h — > t de- 
scribed by 



(38) 



where the ti, hereafter called pseudo-magnetizations, are 
real- valued numbers comprised between — 1 and 1 . Here- 
after, we will infer the most likely values for t, and will 
recover the fields h through (|38[) . The change h —> t 
amounts to consider the energy function 



/v 



E = — <7j tanh 1 



1 



N 



(l=l \i=l 



(39) 

instead of the original expression for E ([3]) (with p = 
0). Obviously, when the identities (|3"8")l are fulfilled, both 
energies are equal (up to a er-independent additive term) 
and define the same likelihood function @. 

We unravel the squared terms in the partition function 
(1351) through a set of p auxiliary Gaussian variables x = 
(x , . . . , x p ), and carry out the summation over the spin 
configurations. We obtain 



Z = 



T-r dx>* 

LL~k= ex P 



/2vr 



N 



log 2 cosh I tanh 1 U + 



N 



(40) 



If N is large enough the dominant contribution to the 
integral will come from x*. the value of x maximizing 
the argument of the exponential above. We obtain the 
following saddle point equation for x, 



where 



tanh _1 ti v " ' " 



N 



(41) 



(42) 



We then write x M = {x^)*+y^ and expand the hyperbolic 
cosine function in powers of y M . The change of variable 
(|38| is such that the linear term in y 11 in the expansion 
of the hyperbolic cosine function cancels out with the 

— ' - , indepen- 

i,n 

dently of the value of (x M )*. Expanding the hyperbolic 



cosine up to the second order in y^ we find our lowest 
order approximation to the partition function, 



Z° = e 1 



IT ^ 
II - ^ ex P 



'2-k 



(43) 



+ 



^EE^w(i-^ 2 



27V 



%/det A 



where F* is the the argument of the exponential in (|40)) 
calculated in x 11 * , 

F* = N] g2 + l^]og(l-Tf)-^^{T i T j -t i t j ) , 

i fj.,ij 

(44) 

and A is the p x p matrix with entries, 

i 

We then compute the average energy U (l36l) , 
U = — mi tanh -1 U 

i 

7YE ~ '"' ' • - + '' 



(45) 



(46) 



Our lowest order approximation for the cross entropy is, 
according to (|3i)). and (|46| : 



$° = -^™ i tanh" 1 ^ i + iVlog2-|-i^log(l-7 1 f 

4—1 i 

' Tn E $ ("« ~ mi m i lo S det A 



+ 2ivE 



E^(^ 



(47) 



The first order contribution to the cross entropy, in 
(|37[) . is obtained by retaining the fourth order in y^ in 
the expansion of the hyperbolic cosine function in (|40[) , 



(48) 

We expect the differences $ — $° and <f> — ($° + 
between, respectively, the true and the lowest order cross 
entropies and the true and the first order cross entropies 
to be of the order of, respectively, R 2 and R 3 , where 



i?=|^ 2 (l-m 2 )A 



(49) 



Here, £ 2 is the order of magnitude of the pattern com- 
ponents, which can range from 1 if the patterns are ex- 
tended over the whole system to ~ \/~N for highly sparse 
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patterns, m is the typical value of the local magnetiza- 
tion, and A is the order of magnitude of the eigenvalues of 
A -1 , which can range from 1 to N. The value of R fixes 
the instrinsic error e on the inferred patterns discussed 
in Section [II G[ e ~ R for the lowest order approximation 
and e ~ R 2 with the first order corrections. 

The above calculation can be straightforwardly ex- 
tended to the case of the generalized Hopfield model 
by considering the p repulsive patterns as patterns with 
purely imaginary components, ^ = i£ , with i 2 = — 1. 
For instance the general lowest order expression for the 
cross entropy is 



i—1 i 



-E 

1 t 

— Y 

2N ^ 



i 



1 A iA 

2 l0Sdet [-iAT A 



(50) 



where 



^ = tanh tanh~ 1 U + V S±JjL _ y 

\ jLl=l v /i=l 



i 



(51) 



The first order correction (|4"5)l can be easily written for 
the case of repulsive patterns, too. 



B. Are the physical properties of the system 
relevant for the inference? 

The Hopfield model was first introduced as a model 
for which a set of p desired ground states £ M (or fixed 
points of the zero temperature Glauber dynamics) could 
be programmed through an adequate choice of the in- 
teractions. Each fixed point has a basin of attraction in 
the configuration space, corresponding to a phase of the 
system. The order parameters are the overlaps 



(52) 



which quantify how much the configurations are on av- 
erage aligned along each pattern. The amplitudes and 
directions of the pattern and the field vectors determine 
if spin configurations tend to be aligned along the field, 
or along one or more patterns. In the infinite size limit 
(N — > oo) the overlaps are the roots of p coupled and 
self-consistent equations, 



1 



= lim — 



N- 



N ^ 



& tanh (hi 



yen 



(53) 



Using (|3"51) and the saddle point equation (|4"T) it is easy 
to check that the overlaps 



4^^ 



(54) 



are solutions to the set of equations (|53p . Solutions are in 
one-to-one correspondance with the saddle points (x^)* . 

The saddle-point solution x* = corresponds to Ti = 
ti. The average interaction term in the energy function 
Q39P vanishes, meaning that configurations tend to be 
mainly determined by the fields. Such a behaviour cor- 
responds to the paramagnetic phase. The solution x = 
is locally stable if the eigenvalues of the matrix A are 
all positive and, thus, if the patterns are weak enough. 
Solutions with x* ^ correspond to stronger patterns 
and interaction terms in Q39p having non zero values on 
average: they correspond to magnetized phases. 

The cross entropy $ depends on the solution x* 
through the variables Ti only. Once the Tj's and the pat- 
terns £ M 's are inferred, it is easy to calculate the value of 
the fields hi based on equations (j38j) . (|4T|) and (|42l . One 
finds that hi is given by (|38| where ti is substituted with 
Ti. Hence, the inferred parameters do not explicitely de- 
pend on the value of x* . The procedure followed to infer 
the patterns and the fields is not affected by the phys- 
ical phase (paramagnetic or magnetized) of the system, 
though the values of the data wii and Cij obviously de- 
pend on those physical properties. 

It may accidentally happen that equations (|4Tj) have 
different solutions with equal or almost equal contribu- 
tions to the partition function Z. The most natural il- 
lustration is the case of zero field (ti = 0) and one strong 
pattern, where two ferromagnetic states with opposite 
overlaps, (x 1 )* and — (x 1 )* , coexist. In this latter case 
both states give equal contributions to the partition func- 
tion. 



C. Maximum Likelihood inference: lowest order 

We first infer the patterns and the pseudo- 
magnetizations from $°. Minimization of $° (|47p over T 
immediately shows that, up to O(R) corrections, pseudo- 
and true magnetizations coincide: 



(Ti] 



(55) 



16 



Without loss of generality we may write the patterns to 
infer as 



0\M 



(f 



V 1 - m i 
V 1 - m f 



(56) 



where V"' 1 , va M are real- valued coefficients, and and 
v M are eigenvectors of T. According to identity ([5"5j) the 
conditions © are fulfilled in the large N limit if the (p+p) 
vectors (3^ and are orthogonal to each other, and to 
all the patterns (1°)" and . The matrices A (g5J| and 

A (|5"T|) are then diagonal, while A vanishes. We rewrite 
the cross entropy (j5TJ)l as 



EE 



i cr=±l 



1 



<T mi 



loe 



1 + (7 TO, 



2^ 



2^ 



A" a" 



A M 



^E lo § 



r (r) /3 M 

•> 

E(^) 
E(^) 



(57) 

where r( r ) is the restriction of T to the (N — p — p)- 
dimensional subspace orthogonal to the p largest and p 
smallest eigenvectors: 



N-p 

v " \ k vM . 



k=p+l 



(58) 



Minimizing $° over the coefficients a M and the vectors 
/3 P gives the coupled set of equations 



A M = 



1 



En;^ 



A" 



1 - a* 1 - b» ' 



(59) 
(60) 



where & M = (/S^ 1 ) 2 is the squared norm of /3 M . If the vector 
/3 M were non zero, it would be an eigenvector of T with 
eigenvalue A M according to (j6"0"|) . This cannot be true as 
the largest eigenvalue of r( r ) is smaller than X p . Hence, 
0" = 6^ = 0. From ([521) we obtain 



1 



1 

A7 



(61) 



We conclude that the maximum likelihood values for the 
p attractive patterns are given by (jH]) . The minimization 



of over the coefficients a p and the vectors (3 can be 
done along the same lines. We find 



A^ 



(62) 



and /3 M = 0. The maximum likelihood estimators for the 
p repulsive patterns are given by (|9]) again. Once the 
patterns are computed the values of the local fields hi 
are obtained from (fTTj) . 

Notice that vff , £f are typically of the order of N~ a , 
which entails that the components of the patterns are of 
the order of unity. Though keeping each of the or- 
der of unity is a natural scaling in the infinite size limit 
N — >• co, other scalings are possible. Consider a pair of 
strongly coupled spins, i.e. such that the correlation T^- 
is sizeably larger than -A. . According to expression (@} for 
the coupling J^- induced by the patterns between spins i 
and j, we expect the pattern components to be of the or- 
der of v^V- There is thus no compelling reason to assume 
that is vanishingly small for all components i. 

To end with we compute the decrease in cross entropy 
when adding a pattern attached to the eigenvalue A (= A M 
or A M ). Inserting expressions (|61I62I) for a M , a M in (|57[) we 
obtain 



A$ = --(A-l-logA) , 



(63) 



a quantity which is strictly negative for A ^ 1. Not 
surprisingly, adding more parameters to the model allows 
for a better fit of the data. We will see in Section IV El 
how the values of p and p can be determined. 



D. Error bars on the patterns and fields 

When the sample size B is large the posterior distri- 
bution P tends to a Gaussian law centered in the most 
likely values for the patterns, {£ }, and the pseudo- 

magnetizations, T. For the sake of simplicty we consider 
below the case of attractive patterns only; repulsive pat- 
terns can formally be seen as purely imaginary attractive 
patterns, sec Section IV Al Let H denote the Hessian ma- 
trix of <!> . We find, to the leading orders, 



(H*% 



8T}dTj 



1 - mi 



(H«C 



5» v 



(A 



vdiTtij — Cij + (1 — mf) 



V p 
A M A' y 



N 2 

a 2 $° 



(l-m 2 )(l-m 2 )(e°)r(e )^, (64) 



13 - dT t d{(?)) 



. 



(65) 



17 



Here, S denotes the Kronecker function and the expres- 
sion of the lowest order coupling matrix, J , is given 
in (|10j) . The sum over p runs over all pattern indices. 
The cross second derivative, H'^, of the order of jj-, is 
much smaller than the expected order, -^=, and can be 
neglected. 

The covariance matrix of the Gaussian posterior proba- 
bility P is the inverse matrix of B H. The inverse is prop- 
erly defined in the subspace of dimension N(p + p + 1) — 
\{p+p){p+p—l), orthogonal to the modes generating the 
invariance over the patterns, see Section III Al We write 
H = DUD, where D is a diagonal matrix with elements: 

Di = \l 1 — to? in the T-sector, and Df = * / . N .> in the 

£ M -sector. Matrix H has a particularly simple expression 
in the eigenbasis of the correlation matrix T, and can be 
diagonalized exactly after some simple algebra. We ob- 
tain the following expression for the covariance matrix of 
the fluctuations: 



(AT, AT,) = 
where 



y'OL-mfXl-m!) 
B 



(66) 



P =i 



The expressions for the fluctuations of the pattern com- 
ponents are reported in (fT3"|) . Note that the cross-term 
(ATj A£J) vanishes at the expected order of and is 
actually of the order of -g only. Using formula (j3"8"j) we 
find that the error over the fields hi is of the order of -7= , 

where a — j^. 



E. Optimal number of patterns 

So far we have assumed that the number of patterns, 
was known. In practice p is often determined based 
on simple criteria, such as how many eigenvalues 'come 
out' from the spectrum of the correlation matrix (Sec- 
tion [VTE2]) • Alternative approaches exist, e.g. Bayesian 
Information Criterion (BIC) [Hj]. In the BIC the de- 
crease i?A$ (|63|) in cross entropy obtained with a new 
pattern is added a 'cost' N\ogB, equal to the number 
of new parameters times the logarithm of the number of 
data. As the index p, increases the selected eigenvalue A M 
or A p gets closer to one; 73|A$| decreases in abso- 
lute value, and, eventually, is counterbalanced by the cost 
term N\ogB. The value of p, for which the two terms 
balance each other depends on the size of the data set: 
the higher B, the more significative are the correlations 
and the more patterns we need to represent the interac- 
tions. However BIC is mathematically justified when B 
is large compared to N, which is not always the case in 
real data sets. 



Hereafter, we propose a different approach based on 
Bayesian and geometric considerations. Based on the dis- 
cussion in Section III Dl we expect the squared norm of 
the transerve fluctuations (3^ to be non vanishing in the 
B, N ~ > 00 limits. Let us call the squared projection of 
the /I th rescaled pattern onto v M (fTB| . The same quanti- 
ties, a v and b , can be defined for repulsive patterns. We 
define the marginal probability Pm of the squared pro- 
jections a M , a v and of the squared norms 6^, b through 



P 



M 



n 



n 



n 



11 Jl-mj 11 niaN/2 

V.t V l n 1 



n 



exp 



n 



x exp 



niaN/2 



Nb») 



Nb u ) 



(68) 



x P 



*• * ' /i 2 ' /i 2 J 

v 1 - m t v 1 - m i 

where P is the posterior probability ([S]), and the sums 
over n and v run from 1 to, respectively, p and p. After 
carrying out the integrals over the fluctuations /3 M and 
(3 we obtain 



P 



M 



1 

= Z~i 
x exp 



(69) 



B x 



2 ^A^^)-^^^) 



where Z\ is a normalization constant and 
A$ M (^) = \» at t + fV l V i + log (l-o"-^) (70) 



loe;/V\ 



-^logdet [fi^l-r«]+0(^| 



1 



logdet [fT 1 



log(l 

■r«l 4 



o 



' + b") (71) 
log TV, 



B J v N ' 1 

Here 1 denotes the A^-dimensional identity matrix. When 
B is large the integrals in (|6"9")l are dominated by the 
contributions coming from the vicinity of the roots of 



<9A$ 



M 



<9A$ 



M 







(72) 



Maximimization of A$m with respect to the a^jb^'s 
gives equations (|59|) and 

ft" = A" , (73) 

for each /j, = l,...,p. We then compute the squared 
norm b^ from the extremization condition (|T2|) and ob- 
tain 

1 1 

'^Ef?^ < 74 > 

a" = 1 - -L - ^ . (75) 
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Repeating the same procedure to maximize A$ m gives 



1 N ~ p 1 
B ^ A fe -A" 

k— p+1 

a v = J- - 1 - 6" . 
A" 



(76) 



The difference between expressions (|61j) and (|75|) for 
the coefficients a M must be emphasized. P defined in 
([5]) is a probability density over pN pattern compo- 
nents, once the pseudo-magnetizations have been in- 
ferred. Maximization of P, or, equivalently, of $ over 
this large-dimensional space gives expression (|61l) for the 
projection a M of the pattern £ M onto the n th largest eigen- 
vector of r, v p . Instead of directly maximizing P, we 
may first integrate out the orthogonal fluctuations to v M 
in P, and obtain the marginal probability density Pm 
for 2p parameters only, namely the squared projections 
on the eigenvectors, a^, and the squared norms of the 
orthogonal fluctuations, b^. Maximizing the marginal 
probability density Pm or, equivalently, minimizing 
shows that b^ (|75l) does not vanish, and that the value of 
the squared projection a M (|T5[) is smaller than (|6"Tj) . Fig- 
ure Q] sketches the geometrical meaning of the coefficient 
y/aP and the fluctuations /3 M , see (fT6|) . Small values of 
the angle 8^ are expected for reliable patterns. A similar 
picture can be drawn for repulsive patterns. We will see 
how expression (|75[) for the squared norm 6^ naturally 
arises in the context of random matrix theory. 



F. Maximum likelihood inference: first corrections 

We now look for the corrections to the lowest order ex- 
pressions of the patterns and the fields (|9|55[) , encoded in 
expressions © and Ti = Tf + T} . The first order contri- 
bution to the cross entropy, $ 1 , can be seen as a pertur- 
bation to the lowest order cross entropy, <£>°, according 
to (1571) . Within linear response theory this perturbation 
will shift the maximum likelihood estimators by 



T 1 

{(en 



(H)- 



d® 1 
dT 

V {— i 



(77) 



where the inverse of the Hessian matrix of <fr°, H, was 
given in Section IV Dl The calculation of the gradient of 
$ 1 does not present any particular difficulty. The result- 
ing corrections to the patterns are given in eqn (|23[) . The 
expression for the shift in the pseudo-magnetization is 



rrii [v. 



ft\2 



(78) 



(A M 1) 



C 



I - mi 



where C k is given in (|26|) . Notice that, if the magne- 
tizations m, vanish, so do the dominant and first-order 
contributions to the pseudo-magnetizations. 

VI. RELIABILITY OF THE INFERENCE 

An important issue is to determine how many config- 
urations should be sampled in order to ensure that the 
inference of the patterns is accurate. To do so, we as- 
sume that the examples <r b are drawn independently and 
at random from the equilibrium probability Ph © of a 
Hopfield model, with fixed fields h and patterns We 
call 5[{er 6 }] the entropy of the posterior distribution P 
§5§ for the fields h and patterns £. In the large N limit, 
we expect this entropy to be self-averaging, that is, to 
depend on the set of examples only through their num- 
ber B. We want to determine how fast S decays with 
B. To do so it is instructive to first consider the simple 
case where the local fields are known, and only one pat- 
tern has to be inferred. This specific situation is treated 
in great analytical details in Section fVI Al The general 
(and harder) case where both fields and patterns have to 
inferred is treated in Section IVIBI 

A. Case of one unknown pattern and known fields 

Throughout this Section, we assume that the local 
fields vanish, h = and that the number of patterns 
to be inferred is p — 1. The posterior entropy, 



s[W b }} = - E p[o,Z\W 

{Si=±£} 



}] logP[0,||{<T b }] , (79) 



therefore measures the uncertainty about this unique pat- 
tern given a set B sampled configurations. Intuitively, 
the dependence of S on B is closely related to the physics 
of the Hopfield model (with pattern £ and zero fields) 
used to generate the examples. If the model is in the 
paramagnetic phase, i. e. if the components of the pattern 
are weak (27j, the examples cr h have vanishingly small 
overlap (|5^|) with the pattern. We expect that a large 
number B (diverging with N) of examples is necessary 
to convey reliable information about the pattern. Con- 
versely, few configurations sampled in a ferromagnetic 
state around a strong pattern (or its opposite) should be 
sufficient to reconstruct the pattern. 

We now make this scenario quantitative in various 
cases. An important simplication arises when the pattern 
is restricted to have binary components, £ = = ±£}, 
with £ > 0. Hamiltonian (J3|) with p — 1 pattern is invari- 
ant under the exchange of the spin configuration and the 
pattern: _E[er,0,£] = E[£,0,cr]. Our inference problem 
can thus be mapped onto a dual Hopfield model, where 
the normalized inferred pattern, plays the role of 
the dual spin configuration and the sampled spin con- 
figurations, cr b , b = 1, . . . , B correspond to the B dual 
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FIG. 17: Entropy of the posterior distribution for the pat- 
terns, S (in bits and per component), as a function of the 
number of sampled configurations, B, when the local fields hi 
are known to vanish, (a). Ferromagnetic regime (£ 2 = 1.1): 
the entropy decays exponentially with B. Inset: compari- 
son with the theoretical prediction exp(—B/B c ) (dashed line), 
with B c — 6.85, in semi-log scale, (b). Paramagnetic regime 
(i 2 = .5): 5" (US} is a decreasing function of a = B/N. The 
entropies calculated from numerical calculations are shown for 
N = 10 and N = 20. Inset: the overlap r ((83j) between the 
inferred and true patterns is positive when a exceeds a c — 1 
<|8TJ). 



patterns. In particular, the posterior entropy S is equal 
to the entropy of the dual Hopneld model at inverse tem- 
perature 

p = ~e . (so) 

The duality property allows us to exploit the well- 
understood physics of the Hopneld model [13] to simplify 
the study of our inference problem. 

1. Strong components 

In the ferromagnetic regime (£ > 1), the dual spin 
configuration is strongly magnetized along the dual pat- 



terns. Going back to the inference problem, we find that 
the overlap between the inferred pattern and a sampled 
configuration, 

q b = m^^n^^E^ 1 ' ( 81 ) 

{v b },£ b i 

may take values +q or — q, where q is the positive root 
of q = tanh(q£ 2 ). The sign of the overlap q b is random, 
depending on which one of the two states with oppo- 
site magnetizations the configuration <x b in sampled in; 
it is equal to + or — with equal probabilities \. These 
statements hold if the thermodynamical limit, N — > oo, 
is taken while B is kept fixed. We find that S is equal 
to the entropy of a single spin at inverse temperature /3, 
interacting with B other spins of magnetization q, 

(82) 

where S(u) = log(2 coshu) — u tanhu. Figure [TTR shows 
that the entropy is almost a pure exponential: log S ~ 
—B/B c where the decay constant, B c — 1/ logcosh(g£ 2 ), 
is finite (compared to AT). In the ferromagnetic regime 
few sampled configurations are sufficient to determine £ 
accurately. 

This result also applies to the case of a single ferro- 
magnetic state. If the field h does not strictly vanish 
and explicitely breaks the reversal symmetry between the 
two states, all configurations are sampled from the same 
state, with probability 1 — exp(— 0(N)). Remarkably, 
expression (|82| for the entropy still holds. Again we find 
that B = 0(1) configurations are sufficient to infer the 
pattern. We will discuss in more details the inference in 
the ferromagnetic regime in Sections IVI B II and IVI B 31 

2. Weak components 

In the paramagnetic phase (£ < 1), the overlap (|8T|) 
between the inferred pattern and an example is typically 
very small, q ~ N^ 1 / 2 . No inference is possible unless the 
number of examples, B, scales linearly with N; we denote 
a = B/N. In this regime, we expect the entropy to bc 
self-averaging: S'[{cr h }] does not depend on the detailed 
composition of the data set and is a function of the value 
of the macroscopic parameters, e.g. the ratio a, only. To 
calculate this function S we use the replica method 0, 
[27} • We report below the results of the replica symmetric 
calculation; technical details can be found in Appendix. 
The order parameter is the average overlap r between the 
inferred and the true patterns, 

{cr b },i b i 

which is solution of the self-consistent equation 

/oo 
Dz tanh(z^7 + 7) , (84) 
-OO 
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where Dz 



■j=e z / 2 is the Gaussian measure, and 



aj3 2 r 



(l-P)(l-p + pr) 



(85) 



S = 



The posterior entropy is equal to 

/oo 
Dz log 2 cosh(z % /7 + 7) - - log(l ~ (3 + /3r) 
-oo ^ 

q/3(l -p-r + Z$t) 

2(l-/3)(l-/3 + /3r) ' [ ' 

and is plotted in Fig. ITTB. To check this analytical pre- 
diction we have run extensive numerical simulations on 
small-size systems (N — 10,20). The numerical proce- 
dure follows three steps: 1. evaluate the partition func- 
tion Z in © through an exact enumeration; 2. generate 
a data set of B — aN configurations {of} according to 
the Hopfield measure Ph by rejection sampling; 3. evalu- 
ate Pi in §5$ and S in (|75|) through exact enumerations. 
The resulting entropy, averaged over one hundred data 
sets, is compatible with the analytical prediction and the 
existence of -h finite-size effects. 

Inset of Fig. [TTB shows that the overlap r remains null 
until a reaches the critical value 



(87) 



Hence, in the range [0;a c ], the posterior probability be- 
comes more concentrated (S decreases), but not around 
the true pattern £. The existence of a lagging phase be- 
fore any meaningful inference is possible is similar to the 
'retarded learning' phenomenon discovered in the field of 
unsupervised learning, where the variables to be learned 
are real- valued [28l430l |. In the present case of binary 
spins we expect the replica symmetric assumption to 
break down at large a. The entropy (|86l) indeed be- 
comes negative when a > «o — 42 for the case studied 
in Fig. ITTB. Nevertheless we may conjecture that the en- 
tropy decays as S ~ — when a — > oo. The dual Hopfield 
model has random couplings Jij, with second moment 
equal to (Jfj) — (Ji 3 ) 2 = Hence T — -j- sets the 
temperature scale of the dual model. The low tempera- 
ture scaling of the entropy of the Sherrington-Kirkpatrick 



(SK) model suggests that S oc T 2 [3l|; this scaling is 
compatible with the small-TV results of Fig. [TTB. How- 
ever the dual and SK models are not strictly identical 
when a — > oo: the coupling matrix J of the dual model 
is guaranteed to be semidefinite positive, while the en- 
tries of J are independent in the SK model. A complete 
calculation of the entropy valid for any (large) a would 
require a re plic a symmetry broken Ansatz for the order 
parameters [32j, and is beyond the scope of this article. 

Note that the calculations above can be extended to 
real patterns; f3 in (150)) is then replaced with (£ 2 ), where 
the average is taken over the pattern components. The 
entropy is not constrained to be positive as in the bi- 
nary case. The distinction between the strong- and weak- 
component regimes remains qualitatively unchanged, and 



so does the value of the critical ratio a c (|8T[) , which does 
not depend on the third and higher moments of £j. 



B. General case of unknown patterns and fields 

In this Section, we first interpret the above results. 
We show that, while B — 0(1) configurations can be 
sufficient in a particular context, B = O(N) data are 
generally necessary for the inference to be sucessful. The 
connection between the results of Section IVI Al and ran- 
dom matrix theory are emphasized. 



1. Inference from the magnetizations 

Consider first the case where a single state exists, i.e. 
equations (|53|) admit a single solution {g^}; the case 
where states coexist will be discussed in Section fVI B 31 
For large N, the average value of spin i with the measure 
© is 



tanh (hi 



(88) 



As the error on the estimate of mt decreases as <• 
with B, 0(1) configurations are sufficient to sample the 
magnetizations accurately. Few sampled configurations 
therefore give access to the knowledge of a linear combi- 
nation of the field vector and pattern vectors with non 
zero-overlaps . This linear combination is simply T\ ■ , 
and equation (155)) coincides with (1551) . 

When the fields hi are known and the model consists of 
a single strong pattern (j> = 1) the pattern components 

can be readily calculated from the magnetizations (|55|) 
through 

-i 1 ^ 2 1 ^ — ^ ^ 

Ei = — tanh mi where q = — > m., tanh to,- . 
q N t— 1 J J 

(89) 

This particular case was encountered at the end of Sec- 
tion [VTXU when the fields hi are sent to zero after hav- 
ing broken the reversal symmetry of the system to avoid 
state coexistence. In the generic situation of unknown 
fields and patterns, knowledge of the magnetizations does 
not suffice to determine the field and the patterns, and 
must be supplemented with the information coming from 
the correlation matrix Tu. 



2. Inference from the correlations: relationship with 
random matrix theory 

What is the order of magnitude of IY; ? We first con- 
sider the ideal case of perfect sampling (B —> oo while 
N is large but finite). As a result of the presence of the 
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patterns in the energy ([3]) the spins are correlated. The 
entries of the correlation matrix are, for large N ,42], 



1 ^yCl-mfXl-m; 



(90) 



where we have considered the case of a single pattern 
(p = l,p = 0) to lighten notations. Though the pattern 
affects each correlation by O(-k) only, these small 
contributions add up to boost the largest eigenvalue from 
one (in the absence of pattern) to 



(91) 



i-^E^i(i-m*) 2 



The eigenv ector attached to L has components v\ oc 



and ML inference perfectly recovers the pat- 



6vT 

tern. 

In the presence of sampling noise (finite B), each corre- 
lation (|90|) is corrupted by a stochastic term of the order 
of x = -4?. This stochastic term will, in turn, produce 

v B 

an overall contribution of the order of x^f~N ~ -7= to 
the largest eigenvalue. Intuitively, whether a is large or 
small compared to L~ 2 should tell us how hard or easy it 
is to extract the pattern £ from V . Several studies in the 
physics [H, Hil and in the mathematics [H[ literatures 
have indeed found that an abrupt phase transition takes 
place at the critical ratio 



1 



{L-lf 



(92) 



It is a simple check that a c coincides with the ratio (|87j) 
for the retarded learning transition calculated in Sections 
IVIA2I 

In the strong noise regime (a < a c ) the largest eigen- 
vector v 1 of r is uncorrelated with (orthogonal to) the 
pattern and the spectrum of T is identical to the one 
of the sample correlation matrix of independent spins, 
whose density of eigenvalues is given by the Marcenko- 
Pastur (MP) law, 



Pmp(X') = v(l-a)S(\')+^ V / W ((A + -A')(A'-A_)) 

(93) 

with v(u) = max(w, 0) The edges of the continuous 
component of the MP spectrum are given by 



A+ = 1 - 



(94) 



The largest eigenvalue of T, A + , is not related to the value 
ofi. 

In the weak noise regime (a > a c ) the largest eigen- 
value of T is [33 



A 1 = L 1 



1 



a(L-l) 



(95) 



It exceeds L for any finite a, and converges to L when 
a — > 00. The rest of the spectrum is described by the 
MP density (|93| . Expression (f74| for the squared norm 
b 1 of the orthogonal fluctuations leads to the analytical 
formula 



* = I 

a 



X- 



+ ,w PMP(X') 



X — L 
A 1 



(96) 



where we have used the analytical expression of the 
Stieltjes transform of pmp [HI- Using (|75|) we deduce the 
value of the squared projection of the inferred rescaled 
pattern onto v 1 , 



L 



A 1 



(97) 



Identities (|96p and (1^71) are graphically interpreted in 
Fig. [TJ b 1 is the squared norm of the orthogonal fluctua- 
tions /3, while a 1 is the squared projection of the rescaled 
pattern £ onto v . 

The above discussion is illustrated on the simple case 
of a Hopfield model with p — l,p = patterns in Fig. [THl 
see caption for the description of the model. Using for- 
mula (|9"T|) we compute the largest eigenvalue of the cor- 
relation matrix for perfect sampling, L = 2. Figure IT51 
shows that a large eigenvalue clearly pulls out from the 
bulk spectrum for the ratio a = 4 (top spectrum) , larger 
than the critical ratio a c = 1 according to ([9"2"]l (bot- 
tom). For a = 4, the infinite- N predicted values for the 
largest eigenvalue, Ai = 2.5 ([95]), and for the edges of 
the MP spectrum, A_ = .25, A+ = 2.25 ([Ml), are in good 
agreement with the numerical results for N = 100. 

Formulae and (|§7| hold for each pattern /1 when 
p > 2 patterns are present, provided that p remains fi- 
nite when N — > 00. The case of p = 2 patterns, where 
one pattern is strong and has overlap q > (|81[) with 
the sampled configurations, and the second pattern has 
weak components, is of particular interest. Again, we as- 
sume that the fields vanish. Repeating the calculation of 
Section IVl A 21 and Appendix A we find that the entropy 
S/N quickly decreases with B from 2 bits down to 1 for 
B = O(l). When B oc N, the entropy decreases from 1 
down to 0; the expression of S coincides with where 
f3 is replaced with /3(1 — q 2 ). Hence we have a two-step 
behaviour: the strong pattern is determined with 0(1) 
examples, the weak pattern requires O(N) sampled con- 
figurations. Learning of the weak pattern is possible if 



a > 



1 



- 1 



(98) 



according to (I57|) . The two-step behaviour agrees with 
the discussion of Section I VI B II 



3. Coexistence of ferromagnetic states 

Consider now the case of the coexistence of two ferro- 
magnetic states exposed in Section IV Bl Data are gen- 
erated from a Hopfield model, with zero fields and one 
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FIG. 18: Spectrum of the correlation matrix for a Hopfield 
model with p = 1 pattern, iV = 100 spins, and for B = 100 
(bottom) and 400 (top) randomly sampled configurations at 
equilibrium. The bulk parts of the spectra coincide with the 
Marcenko-Pastur law for random correlation matrices. When 
B is large the top eigenvalue clearly comes out from the noisy 
bulk and the corresponding eigenvector approximately cor- 
responds to the pattern. The pattern components are i.i.d. 
Gaussian variables, of zero mean and variance £ 2 = .5; local 
fields hi have zero values. 



strong pattern £, as in Fig. |4] In the up-state the spins 
are magnetized with = tanh(g^). In the down-state 
the local magnetization is n\ = — mf. On the overall 
the local magnetization is mi = \ m t I m 7 = 0' U P ^° 
0(-4=) fluctuations. The discrepancy between the Gibbs 
magnetizations, m; = 0, and the state magnetizations, 



m i , results in a 0{1) contribution mTm^ (= m i rrij ) to 
the correlation matrix entry Fy, dominating the O(jj) 
contributions due to the interactions between spins. The 
largest eigenvalue of T, 



,+^2 



(99) 



is of the order of N; the corresponding eigenvector is 



or 



)/vA 1 . Informally speaking, the 



information about the state magnetizations is not con- 
veyed by the Gibbs magnetizations (as in Section FVIB ip 
but by the correlation matrix (36j . According to formula 
(|55[) the pseudo-magnetization Ti vanishes; hence we cor- 
rectly infer that the fields hi have zero values. Using 
formula ^ we obtain 



(100) 



Therefore, the inferred pattern component is not equal 
to the true pattern component, but is proportional to its 
hyperbolic tangent. This non linear transform is clearly 
seen in Fig. 2J The discrepancy between the true and 



inferred components is a nice illustration of the claimed 
scaling for the higher order corrections in ([4U]) (recall that 
the eigenvalues of A^ 1 are the p largest eigenvalues of 
r). In the presence of coexistent states, while £ 2 is small 
compared to TV, A 1 is of the order of N, making the ratio 
—S- of the order of unity. Corrections are required and 
shown to improve the quality of the inferred pattern in 
Fig.DU 



VII. CONCLUSION 

In this paper we have studied how to infer a small-rank 
interaction matrix between TV binary variables given the 
average values and pairwise correlations of those vari- 
ables. We have seen that the generalized Hopfield model, 
where the interactions are encoded into a set of attractive 
and repulsive patterns is a natural framework for Max- 
imum Likelihood (ML) inference. Using techniques from 
the statistical physics of disordered systems, we have pre- 
sented a systematic expansion of the log-likelihood in 
powers of A^, where A is the largest eigenvalue of the 
correlation matrix T ([T]). We have then calculated the ML 
estimators for the patterns and the fields to the lowest 
and first order in this expansion in a variety of physical 
regimes. The lowest order is a simple extension of Princi- 
pal Component Analysis, where not only the largest but 
also the smallest eigenmodes build in the interactions. 
First order corrections involve non-linear combinations 
of the eigenvalues and eigenvectors of T. We have vali- 
dated our ML expressions for the patterns on synthetic 
data generated by Hopfield models with known patterns 
and fields, and by Ising models with sparse interactions. 
We have also presented a simple geometrical criterion 
for deciding the number of patterns. Those results have 
been discussed and compared to previous studies in the 
unsupervised learning and random matrix literatures. 

The quality of the inference strongly depends on the 
number of sampled configurations, B. The sampling er- 
ror on each magnetization, rm, and pairwise correlation, 
Cjj, is of the order of B~ x / 2 . Elementary insights from 
random matrix theory suggest that the resulting errors 
on the eigenvectors of the matrix T are s/N times larger. 
The error on the inferred patterns, e, picks up a con- 
tribution ~ (jj) 1 ^ 2 due to finite sampling, as found in 
Section III CI This scaling has several important con- 
sequences. First, inference is retarded: no information 
about the true couplings can be obtained unless the ratio 
^ exceeds a critical value (Sections IVI A 21 and IVI B~2|) . 
Secondly, for larger B, e decreases as 73 -1 / 2 , which is 
confirmed by the simulations presented in Fig. I12| and 
then saturates to the intrinsic error resulting from our 
approximate expressions for the patterns. The intrinsic 
error depends on the order in the expansion used for the 
calculation of the cross-entropy in Section [V] Note that 
other inference methods, loo king for the local structure of 
the interaction network [ill . Il2j | may unveil strong cou- 
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plings J = 0(1) from a much smaller number of sampled 
configurations, B = 0(log N), and do not suffer from the 
retarded learning transition. 

Our study could be extended in several directions. It 
would be particularly interesting to consider the case of 
spins taking Q > 2 values (Potts model), e.g. for appli- 
cations to the study of coevolution between residues in 
protein sequences [23|, [25|, |37j • Mean- field inference meth- 
ods provide a simple and efficient way to get interactions 
from correlations (38[. Knowing how MF interactions 
are modified when some eigenmodes are rejected (using 
the criterion of Section III D|) or first-order corrections are 
taken into account would be of interest. However the lin- 
ear increase in the number of possible symbols with Q 
(= 20 for amino-acids) may make the effective size of 
the problem, N x Q, larger than the number of config- 
urations, B, in practical applications. A large number 
of vanishing eigenvalues is expected in those cases, and 
extracting repulsive patterns may become a difficult task. 

Appropriate priors Pq could also be used to force many 
pattern components to identically vanish, instead of ac- 
quiring small values as in Section III El This can be par- 
ticularly useful when the true patterns are known to be 
highly sparse and few data are available. Inspired by the 
so-called Lasso regression method [39| , a natural prior is 



Po cx exp 



"7 



JV 

E 



i 



i—l \f-i— 1 

(101) 

Contrary to the case of the quadratic penalty ([21]) the 
most likely values for the patterns cannot be expressed 
by means of simple analytical formulae. However, they 
could be efficiently obtained using convex optimization 
algorithms minimizing the sum of the cross entropy and 
of the penalty term (|101[) . 

Last of all, we have considered in this work that the 
configurations were sampled at equilibrium. In practice, 
when more than one state exist, the equilibration time 
may be prohibitive and a reasonable assumption would 
be to sample from one state only. To what extent er- 
godicity breaking in the sampling affects the quality of 
inference is an interesting question. 
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partition function (f35|) of the Hopficld model through 



E 



exp 



N 



E-^ 



2N 



(Al) 



where the inverse temperature /3 is defined in (|80l) . The 
partition function is thus independent of the pattern di- 
rection, which makes the calculation considerably sim- 
pler. The posterior entropy (fT5|) can be written as 



S\{a''y_=[l-^)^Xia 1 '}. i\. <A2) 



where 



N[{a b }, 0\ = J2 [ f E E 



{€} 



(A3) 



6=0 i<3 



Thus, we are left with the calculation of N[{cr b }}. The 
expression for N is formally identical to the partition 
function of a dual Hopfield model where the B measured 
configurations cr b play the role of the dual patterns and 
£ plays the role of the dual spin variables. The posterior 
entropy S is simply the entropy of this dual Hopfield 
model. 



Equation (|A2[) gives the entropy of the system for a 
particular set of measures {cr b }. It is natural to ex- 
pect the entropy to be reproducible across different sets 
of measurements. In this context, we are interested in 
evaluating the average of the entropy with respect to all 
possible measurements. Assuming that the configura- 
tions {cr b } are sampled from the equilibrium measure of 
a Hopficld model with one pattern £, we write the aver- 
age entropy as 
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where 



(logjVX^/3) = 
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Appendix A: Replica calculation of the entropy S 
for weak patterns 

When the pattern has binary components £j = ±£ we 
make the change of variables a[ = £j<7j to rewrite the 



where we have introduced a new variable j3 since we 
should not take the derivative only with respect to /3 
in (|A"4"1) . 

To calculate the average value of the logarithm of N 
in (|A5|) we use the replica trick [27j and estimate the n th 
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moment of N, 
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We introduce auxiliary Gaussian variables, denoted by 
rhb, to linearize the quadratic term in the spins u\. We 
obtain, after summation over the spins, 
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(A7) 
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In the paramagnetic phase we expect the variables m£ 
and rhb to be of the order of . Expanding the hyper- 
bolic cosine to the second order in those variables and 
carrying out the resulting Gaussian integral we obtain 



(N n ) ~ e- 0Bn l 2 Yj I det M \ 



-B/2 



(A8) 



Here, M is the (n + 1) x (n + 1) matrix with elements 

P = a < P , 



Mr, 



1-/3 if 

1 — /3 if p = a = p + 1 , 

/3/3t CT if p = p+l, ct<p, (A9) 

/3^t p if p<p, a=p + l 

fir pa 



with the overlaps defined through r pa — -k JT and 

*p = "^Si^fCi- We now enforce the definitions of the 
overlaps using conjugated Lagrange multipliers, r p<7 and 



i p , and obtain 



(N n ) = 
where S is given by 
S = ^ exp 
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We look for a replica-symmetric saddle point of S: r pcr = 
i ; ^po- = r and t„ — t. We obtain, after some 



r, t p 

elementary algebra, 
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(A12) 



where Dz — dze z I 2 / v / 2~7r is the Gaussian measure and 

detM = {l-p + pr) n - 1 [(l-£)(l-/3) 

- (n-l)(l- P)(3r-n/3pt 2 ] . (A13) 

We now send n to zero. The saddle-point equations show 
that t = r; this result was expected from the fact that, 
if (3 — f3, the true pattern £ plays the role of an extra 
replicated pattern £. In addition, t = f = 7, where 7 is 
defined in (|85]). The self-consistent equations for r and 
the entropy S are given by, respectively eqns (|84|) and 
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